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Abstract 

Real-time  programmable  graphics  hardware  has  resource  constraints  that  prevent  complex  shaders  from  rendering 
in  a  single  pass.  One  way  to  virtualize  these  resources  is  to  partition  shading  computations  into  multiple  passes, 
each  of  which  satisfies  the  given  constraints.  Many  such  partitions  exist  for  a  shader,  but  it  is  important  to  find 
one  that  renders  efficiently.  We  present  Recursive  Dominator  Split  (RDS),  a  polynomial-time  algorithm  that  uses 
a  cost  model  to  find  near-optimal  partitions  of  arbitrarily  complex  shaders.  Using  a  simulator,  we  analyze  parti¬ 
tions  for  architectures  with  different  resource  constraints  and  show  that  RDS  performs  well  on  different  graphics 
architectures.  We  also  demonstrate  that  shader  partitions  computed  by  RDS  can  run  efficiently  on  programmable 
graphics  hardware  available  today. 

Categories  and  Subject  Descriptors:  L3.1  [Computer  Graphics]:  Graphics  processors;  G.2.2  [Mathematics  of  Com¬ 
puting]:  Graph  Algorithms,  Trees 

Keywords:  programmable  graphics  hardware,  multipass  rendering,  graph  partitioning  algorithms,  shading  lan¬ 
guages 


1,  Introduction 

Real-time  programmable  shading  using  mainstream  graph¬ 
ics  hardware  has  been  an  active  area  of  research  in  recent 
years.  Earlier  generations  of  graphics  hardware  provided 
fixed-function  pipelines  designed  for  rendering  texture- 
mapped  triangles.  In  contrast,  commodity  graphics  chips 
today  support  complex,  user-programmable  shading  while 
maintaining  high  performance.  This  flexibility  has  encour¬ 
aged  the  development  of  real-time  shading  languages  that 
target  these  chips. 

Shading  languages  evolved  from  the  early  work  of  Cook^ 
and  Perlin^^.  The  RenderMan  Shading  Language  is  com¬ 
monly  used  today  for  movie  production-quality  shading  in 
software  rendering  systems^.  Olano  and  Lastra  described 
pfman,  the  first  shading  language  that  targets  graphics  hard¬ 
ware  for  real-time  rendering^^.  Peercy  et  al.  proposed  a 
method  for  mapping  shading  languages  to  multiple  ren¬ 
dering  passes  on  non-programmable  commodity  graphics 
hardware*^.  Proudfoot  et  al.  described  a  system  that  maps  a 
shading  language  to  programmable  graphics  hardware  using 
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a  retargetable  compiler  back  end^^.  These  last  three  systems 
are  able  to  render  high-quality  shaders  in  real-time. 

Graphics  chips  today  provide  user-programmable 
pipelines^*' These  longer  pipelines  accomodate  larger 
shaders  with  more  sophisticated  operations.  However,  this 
hardware  still  has  a  limited  set  of  resources.  Examples  of 
such  limits  are: 

•  A  fixed  memory  size  for  instruction  storage,  i.e.  a  maxi¬ 
mum  number  of  instructions. 

•  A  fixed  number  of  active  textures,  texture  accesses,  and 
texture  dependencies. 

•  A  fixed  number  of  registers  for  storing  temporary  values. 

•  A  fixed  number  of  interpolants  for  storing  vertex-to- 
fragment  values. 

With  shading  languages,  it  is  easy  to  write  large  shaders  that 
exhaust  available  resources  and  cannot  be  mapped  to  a  sin¬ 
gle  rendering  pass.  The  hardware  can  be  virtualized  by  par¬ 
titioning  the  shader  into  multiple  passes,  where  each  pass  is 
a  subset  of  the  entire  computation  that  satisfies  all  resource 
constraints.  Many  such  partitions  exist,  and  it  is  desirable  to 
find  the  one  that  renders  most  efficiently.  We  call  this  the 
Multipass  Partitioning  Problem  (MPP). 

Peercy  et  al.  solved  this  problem  for  non-programmable 
graphics  hardware  by  using  dynamic  programming.  Their 
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system  interally  represents  shading  computations  as  directed 
acyclic  graphs  (DAGs).  They  first  decompose  a  DAG  into 
trees,  then  perform  tree-matching  to  find  a  minimum-cost  set 
of  passes.  This  approach  works  well  for  non-programmable 
hardware,  which  supports  only  a  small  set  of  operations 
per  pass.  Proudfoot  et  al.  observed,  however,  that  tree¬ 
matching  techniques  are  inadequate  for  mapping  DAGs  to 
programmable  hardware.  To  address  this  issue,  they  devel¬ 
oped  a  back  end  for  their  shading  system  specifically  to 
target  this  hardware.  However,  they  did  not  solve  MPP  for 
programmable  hardware,  so  their  back  end  can  only  handle 
shaders  that  map  to  a  single  rendering  pass. 

In  this  paper,  we  present  Recursive  Dominator  Split 
(RDS),  an  algorithm  that  solves  MPP  for  programmable 
graphics  hardware.  RDS  uses  dominator  trees,  a  heuris¬ 
tic  search,  and  a  greedy  merging  strategy  to  approximate 
minimum-cost  partitions.  Using  a  simulator,  we  show  that 
RDS  finds  partitions  within  5%  of  optimal  for  different 
shaders  on  architectures  with  different  resource  constraints. 
We  also  demonstrate  how  RDS  can  be  used  with  an  existing 
shading  system  to  partition  and  render  a  multipass  shader  on 
a  programmable  graphics  card. 

2.  Overview 

We  designed  RDS  in  the  context  of  the  Stanford  Real-Time 
Programmable  Shading  System*^  This  system  is  illustrated 
in  Figure  1 .  A  shader  enters  the  system  as  source  code  writ¬ 
ten  in  a  high-level  language.  The  compiler  front  end  parses 
this  code  and  generates  an  intermediate  pipeline  program 
split  by  computation  frequency.  A  compiler  back  end  maps 
the  pipeline  program  to  hardware  rendering  passes.  In  this 
paper  we  discuss  back  end  modules  that  map  fragment  com¬ 
putations  to  multiple  passes.  Each  back  end  performs  in¬ 
struction  selection  on  the  fragment  portion  of  the  pipeline 
program  and  builds  a  DAG  of  hardware-specific  fragment 
operations.  If  the  back  end  cannot  map  the  DAG  to  a  sin¬ 
gle  rendering  pass,  it  calls  RDS  to  partition  the  DAG  into 
multiple  passes. 

There  are  many  possible  partitions,  so  RDS  uses  a  cost 
model  to  evaluate  them.  The  model  reflects  performance 
characteristics  of  the  target  architecture,  such  as  the  cost  of 
operations  and  per-pass  overhead.  The  goal  is  to  find  the  op¬ 
timal  solution  to  MPP,  i.e.  a  partition  with  the  minimum  cost. 

MPP  is  related  to  some  well-studied  graph  partition¬ 
ing  and  NP-optimization  problems.  However,  MPP  is  suf¬ 
ficiently  different  that,  as  far  as  we  have  been  able  to  deter¬ 
mine,  existing  techniques  cannot  be  easily  adapted  to  solve 
it.  For  example,  the  problem  of  load  balancing  parallel  ap¬ 
plications  can  be  formulated  in  terms  of  finding  a  balanced 
subdivision  of  a  graph^,  but  algorithms  to  solve  this  problem 
are  not  immediately  applicable  to  MPP.  This  is  because  in 
the  load  balancing  case,  setting  the  number  of  desired  graph 
cuts  is  fixed,  whereas  in  MPP  it  is  part  of  the  solution.  More 
fundamentally,  graph  partitioning  algorithms  tend  to  assume 
that  partitions  are  disjoint,  whereas  MPP  allows  partitions 
with  overlapping  subregions.  In  fact,  MPP  solutions  almost 


Figure  1;  System  block  diagram.  The  compiler  front  end  converts  a 
shader  from  shading  language  source  code  to  an  intermediate  rep¬ 
resentation.  The  compiler  back  end  performs  instruction  selection  to 
build  a  DAG  of  hardware-specific  operations.  If  the  DAG  is  too  large 
to  he  mapped  to  a  single  pass,  RDS  partitions  the  DAG  into  multiple 
passes.  The  compiler  then  generates  assembly  code  for  each  of  these 
passes  and  sends  them  to  the  hardware  for  rendering. 

always  contain  overlapping  regions,  which  correspond  to  re¬ 
computed  operations. 

Our  solution  to  MPP  is  based  on  a  number  of  architectural 
assumptions.  We  assume  that  architectures  support  only  one 
4-component  output  per  pass.  There  may  be  many  outstand¬ 
ing  values  at  a  time,  but  the  framebuffer  can  only  store  one 
of  them.  Hence,  intermediate  results  are  usually  preserved 
by  copying  the  framebuffer  contents  to  texture  memory.  Al¬ 
ternatively,  render-to-texture  can  be  used  to  avoid  this  expen¬ 
sive  framebuffer  copy.  In  either  case,  we  say  that  intermedi¬ 
ate  results  are  saved.  These  values  are  restored  in  subsequent 
rendering  passes  via  texture  fetches.  This  save- and- re  store 
technique  relies  on  the  following  two  assumptions: 

•  Architectures  preserve  intermediate  values  properly.  For 
example,  if  the  architecture  uses  floating  point  data  types, 
then  it  must  also  support  floating  point  buffers  and  tex¬ 
tures  to  preserve  high-precision  intermediate  results. 

•  Given  Q  the  branching  factor  of  the  DAG,  architectures 
support  at  least  □□  □  operations,  □□  □  texture  units,  □ 
registers,  1  vertex  interpolant,  and  1  level  of  texture  de¬ 
pendency  per  pass.  This  is  the  minimum  set  of  resources 
required  to  support  an  arbitrary  DAG  via  multipass  ren¬ 
dering.  For  example,  □  □  Dif  hardware  instructions  can 
have  at  most  3  operands. 

Note  that  even  with  these  assumptions,  multipass  render¬ 
ing  is  an  imperfect  virtualization  technique.  It  does  not  pro- 
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duce  correct  results  for  overlapping  transparent  geometry. 
This  is  a  fundamental  limitation  of  multipass  rendering  that 
could  be  overcome  with  changes  to  hardware^®. 

Multipass  rendering  creates  a  per-pass  overhead  cost.  This 
overhead  arises  from  re-execution  of  the  graphics  pipeline 
(including  transformation  and  rasterization  of  geometry), 
saving  the  results  of  a  pass  to  texture  memory,  and  restoring 
these  results  in  later  passes.  Furthermore,  extra  passes  may 
require  more  bounding  box-sized  or  viewport-sized  textures 
for  saving  intermediate  results;  these  large  textures  consume 
GPU  texture  memory. 

For  these  reasons,  RDS  attempts  primarily  to  minimize 
the  number  of  passes.  Given  a  DAG  of  fragment  operations, 
RDS  first  identifies  nodes  that  are  multiply-referenced.  RDS 
then  searches  over  these  nodes  and  uses  a  heuristic  to  decide 
if  the  subgraphs  rooted  at  these  nodes  should  be  saved  in  a 
separate  pass  or  recomputed.  It  packs  the  remaining  nodes 
into  as  few  passes  as  possible  using  a  greedy  bottom-up 
merging  algorithm.  RDS  optimizes  both  of  these  steps  by 
using  a  dominator  tree  to  group  sufficiently  small  regions  of 
the  DAG  together  into  a  single  pass.  Minimizing  the  number 
of  passes  in  this  manner  helps  to  minimize  the  overall  cost. 

3.  Algorithms 

We  begin  this  section  by  formulating  our  problem  and  de¬ 
scribing  a  simple  and  optimal  solution  to  MPP.  However, 
in  practice  this  algorithm  is  intractable  because  it  exhaus¬ 
tively  searches  a  large  space  of  possible  partitions.  We  then 
identify  the  key  issues  of  MPP  by  studying  the  structure  of 
the  problem.  Understanding  these  issues  helps  us  closely 
approximate  the  minimum  cost  without  having  to  search 
over  the  whole  space  of  partitions.  Finally,  we  propose  an 

□  DO  ■  Cmilllalgorithm  called  RDSh  and  an  □  DCP  •  LiElH 
algorithm  called  RDS,  where  OUnis  the  cost  of  checking  if 
a  set  of  □  nodes  can  be  mapped  to  one  pass. 

3,1.  Preliminaries 

We  can  represent  the  space  of  possible  partitions  by  marking 
nodes  to  indicate  pass  boundaries.  More  precisely,  we  mark  a 
node  if  and  only  if  the  node  is  the  root  of  a  pass.  The  number 
of  marked  nodes  equals  the  number  of  passes  in  the  partition. 
An  example  is  shown  in  Figure  2. 

The  subregion  of  a  node  D  is  the  set  of  nodes  including 

□  and  recursively  all  unmarked  children  of  □  .  For  example, 
the  subregion  of  node  □  in  Figure  2b  is  GGED 1 1 1  U  CDQ  A 
subregion  is  valid  if  it  can  be  mapped  to  one  pass;  otherwise 
it  is  invalid. 

We  now  describe  a  simple  and  optimal  solution  to  MPP: 
examine  the  entire  space  of  partitions,  evaluate  the  cost  of 
each  partition,  and  keep  the  one  with  the  lowest  cost.  Sup¬ 
pose  there  are  □  nodes  in  the  DAG.  Since  each  node  may 
be  marked  or  unmarked,  there  are  CP  unique  ways  to  mark 
the  nodes;  each  of  these  yields  a  possible  partition.  This  ex¬ 
haustive  algorithm  is  clearly  intractable  because  the  partition 
space  grows  exponentially  with  the  size  of  the  DAG. 


Figure  2:  A  DAG  with  root  □  is  shown  in  (a).  We  mark  nodes  to  in- 
dicate  the  tops  of  passes.  For  example,  by  marking  nodes  0,0,  and 
0 ,  we  split  the  DAG  into  three  passes  as  shown  in  (b).  On  the  other 
hand,  by  marking  nodes  0  and  0,  we  split  the  DAG  into  two  passes 
as  shown  in  (c),  which  causes  nodes  0  and  0  to  be  recomputed. 
Marked  nodes  are  shaded,  and  recomputed  nodes  are  squares. 

3.2.  Greedy  Bottom-Up  Merging 

Since  searching  exhaustively  is  intractable,  we  propose  a 
greedy  bottom-up  algorithm  that  merges  nodes  into  as  few 
passes  as  possible.  We  make  a  postorder  traversal  of  the 
DAG  that  applies  the  following  steps  at  each  node  □  : 

Merge(node  □ ) 

1  0  the  number  of  kids  of  □ 

2  for  0 Odown  to  0 

3  do  for  each  subset  0  of  0 ’s  kids  with  Okids 

4  do  try  to  merge  □  with  all  subregions  of  the  kids  of  □ 

5  if  exactly  one  subset  can  be  merged  with  0 

6  then  pick  that  subset  and  stop 

7  else  if  two  or  more  subsets  can  be  merged  with  □ 

8  then  use  M  E RG  E  heuristic  to  pick  one 

The  algorithm  is  greedy  in  the  sense  that  it  starts  from 
the  largest  possible  merge  and  only  considers  progressively 
smaller  subsets  when  necessary.  Sometimes,  there  is  more 
than  one  subset  of  a  given  size  that  can  be  merged.  For 
example,  suppose  a  node  has  two  children  and  that  it  can 
be  merged  with  either  the  left  child  or  the  right  child,  but 
not  both.  We  then  use  a  hardware-specific  heuristic  called 
MERGE  to  choose  one  of  two  possible  merges.  In  princi¬ 
ple,  MERGE  should  pick  the  subset  of  children  that  uses  the 
fewest  resources,  since  this  leaves  the  most  room  for  addi¬ 
tional  nodes  to  be  merged  with  this  pass.  Sometimes  it  is 
unclear  which  pass  consumes  the  fewest  resources.  This  can 
occur,  for  example,  if  one  pass  uses  5  interpolants  and  3  tex¬ 
tures,  but  another  pass  uses  3  interpolants  and  5  textures.  Our 
implementation  of  MERGE  breaks  these  ties  arbitrarily. 

3.3.  Save  vs.  Recompute 

Some  nodes  in  the  DAG  are  referenced  more  than  once;  we 
call  these  multiply-referenced  (MR)  nodes.  Subregions  of 
these  nodes  may  be  saved  or  recomputed.  For  example,  the 
subregion  of  MR  node  □  is  saved  in  Figure  2b  but  recom¬ 
puted  in  Figure  2c.  Always  saving  is  undesirable  because 
each  save  creates  an  additional  pass.  However,  always  re- 
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computing  is  also  undesirable,  since  it  could  lead  to  an  ex¬ 
plosion  in  the  number  of  recomputed  operations.  It  is  unclear 
which  choice  will  lead  eventually  to  the  best  partition.  Intu¬ 
itively,  we  should  save  if  the  subregion  is  “full*’  and  recom¬ 
pute  if  the  subregion  is  “empty”  relative  to  the  architecture’s 
available  resources. 

Both  saving  and  recomputing  involve  a  cost,  but  if  we  can 
map  all  of  the  references  to  a  MR  node  to  a  single  pass,  then 
we  can  avoid  these  costs.  We  can  detect  these  cases  by  iden¬ 
tifying  the  immediate  dominators  of  MR  nodes.  Intuitively, 
the  immediate  dominator  □  of  MR  node  □  is  the  node  “clos¬ 
est”  to  □  such  that  all  the  references  of  □  go  through  □  • .  If 
subregionCD  Dand  subregionCDDcan  be  mapped  together  to 
a  single  pass,  then  we  can  avoid  the  save  vs.  recompute  de¬ 
cision  for  node  □. 

Since  we  are  interested  in  MR  nodes  and  their  immedi¬ 
ate  dominators,  we  would  like  to  store  them  in  a  convenient 
data  structure.  We  use  a  data  structure  that  we  call  a  par¬ 
tial  dominator  tree  (PDT),  which  in  turn  is  constructed  from 
the  dominator  tree  of  a  DAG.  In  a  dominator  tree,  the  parent 
of  each  node  is  its  immediate  dominator.  This  structure  is  a 
tree  as  opposed  to  a  DAG  because  each  node  except  the  root 
has  a  unique  immediate  dominator ^  A  PDT  is  obtained  from 
a  dominator  tree  by  discarding  all  nodes  except  MR  nodes, 
their  immediate  dominators,  and  the  root.  This  construction 
is  illustrated  in  Figure  3. 


Figure  3:  A  DAG  is  shown  in  (a)  with  multiply-referenced  nodes 
drawn  shaded.  The  dominator  tree  for  this  DAG  is  shown  in  (b). 
Node  □  is  the  parent  and  therefore  the  immediate  dominator  ofU  . 
Similarly,  □  is  the  immediate  dominator  of  both  □  and  □  .  The  par¬ 
tial  dominator  tree  in  (c)  is  obtained  from  the  tree  in  (b)  by  keep¬ 
ing  only  the  multiply-referenced  nodes,  their  immediate  dominators, 
and  the  root. 


3.4.  RDSh 

In  this  section,  we  describe  an  algorithm  called  RDSh .  We 
apply  the  ideas  of  the  previous  section  by  mapping  multiply- 
referenced  subregions  and  the  subregions  of  their  immediate 
dominators  to  the  same  pass.  Since  this  isn’t  always  possible, 
we  break  the  problem  down  by  recursively  subdividing  the 
DAG  into  smaller  regions  using  the  PDT.  When  necessary, 
we  perform  greedy  bottom-up  merging  within  these  regions 
and  evaluate  save  vs.  recompute  decisions  using  a  heuristic. 


Pseudocode  for  the  algorithm  is  shown  below. 

RDSh  (DAG  □ ) 

□'  <—  the  root  node  of  the  PDT  of  □ 

Subdivide  (□') 

Subdlvide(PDT  node  □') 

1  if  subrcgion(D)  is  invalid 

2  then  D  «—  the  list  of  children  of  ordered  to  maintain 

3  DAG  dependencies 

4  for  each  element  0  '  of  □ 

5  doSubdivide(D ') 

6  if  □  is  a  MR  node 

7  then  use  RECOMPUTE  heuristic  to  decide 

8  if  □  should  be  saved  or  recomputed 

9  apply  greeding  merging  to  subregion(O) 

In  this  pseudocode,  we  use  the  following  convention  to 
describe  the  relationship  between  nodes  in  the  DAG  □  and 
its  PDT  □ .  If  □  □  □  and  □  □  □ ,  we  refer  to  the  node  in  □ 
as  □  and  refer  to  the  node  in  □  as  □  In  other  words,  □  and 
□ '  represent  the  same  node,  but  in  different  structures.  This 
distinction  is  subtle  but  important.  For  example,  the  children 
of  □  are  nodes  in  □ ,  whereas  the  children  of  □ '  are  nodes 
in  □. 

The  RDSh  algorithm  takes  an  unmarked  DAG  and  calls 
Subdivide  to  partition  it.  The  Subdivide  procedure  marks 
nodes  to  indicate  pass  boundaries  as  described  in  Section 
3.1.  The  algorithm  first  checks  at  each  node  if  its  subregion 
is  small  enough  to  map  to  one  pass  (line  1).  If  this  check 
fails,  the  problem  is  broken  down  into  recursive  subdivisions 
of  each  child  (line  5).  We  subdivide  children  in  an  order  that 
maintains  DAG  dependencies,  which  is  the  same  as  the  order 
given  by  a  postorder  traversal  of  the  DAG.  For  example,  in 
the  PDT  shown  in  Figure  3c,  nodes  □  and  □  are  both  chil¬ 
dren  of  □ .  Since  □  always  appears  before  Dina  postorder 
traversal  of  the  DAG  shown  in  Figure  3a,  we  subdivide  □ 
first,  then  □ . 

After  subdividing  each  child  that  is  multiply-referenced, 
we  use  a  heuristic  called  RECOMPUTE  to  decide  if  that 
child’s  subregion  should  be  saved  or  recomputed  (lines  6- 
8);  saved  children  are  marked.  In  principle,  subregions  that 
use  few  resources  should  be  recomputed,  whereas  those  that 
consume  most  of  the  available  resources  should  be  saved. 
We  implement  RECOMPUTE  by  choosing  to  recompute  a 
set  of  nodes  if  and  only  if  the  consumption  of  each  resource 
is  less  than  one-half  the  maximum  allowed.  However,  this 
heuristic  can  be  replaced  with  one  that  is  more  specific  to  a 
given  architecture. 

Finally,  after  all  children  have  been  subdivided,  we  apply 
the  greedy  merging  algorithm  described  in  Section  3.2  to  the 
current  subregion  (line  9).  During  this  step,  nodes  that  can¬ 
not  be  merged  with  their  parents  are  marked. 

The  Subdivide  procedure  makes  only  one  traversal 
through  the  PDT,  but  at  each  node  it  checks  the  validity  of  its 
subregion.  If  CIIDis  the  cost  of  this  check,  then  the  overall 
running  time  of  RDSh  is  □  CE  •  CIDEIJ  where  □  is  the  size 
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of  the  DAG.  In  our  implementation,  the  target  architecture’s 
compiler  makes  each  validity  check  by  generating  code  and 
allocating  resources  for  the  subregion.  This  can  be  done  in 
linear  time,  so  our  implementation  of  RDSh  runs  in  □  CUf  □ 
time. 

3.5.  RDS 

RDSh  uses  a  simple  heuristic  to  make  save  vs.  recompute 
decisions.  However,  correct  decisions  are  difficult  to  make 
because  they  are  interdependent.  For  example,  suppose  we 
have  two  MR  nodes  □  and  D  such  that  □  depends  on  □ .  If 
we  save  □  to  a  separate  pass,  then  the  subregion  of  □  be¬ 
comes  smaller  and  may  be  worth  recomputing.  On  the  other 
hand,  if  we  recompute  0 ,  then  the  subregion  of  □  becomes 
larger  and  may  be  worth  saving.  It  is  difficult  for  a  simple 
heuristic  to  predict  which  of  these  two  choices  will  eventu¬ 
ally  lead  to  the  minimum-cost  partition. 

One  way  to  address  this  problem  is  to  search  over  the  MR 
nodes  exhaustively  and  try  all  possible  save/recompute  con¬ 
figurations.  However,  this  increases  the  overall  running  time 
by  a  factor  of  tf ,  where  □  is  the  number  of  MR  nodes  in 
the  DAG.  Instead,  we  propose  a  less  expensive  alternative 
called  RDS  that  combines  a  limited  search  with  the  existing 
RECOMPUTE  heuristic. 

The  pseudocode  for  RDS  is  shown  below. 

RDS  (DAG  0 ) 

□ '  >«—  the  root  node  of  the  PDT  of  □ 

Search(0,n') 

Search(DAG  0 ,  PDT  node  □') 

1  D  ^  list  of  the  MR  nodes  of  □  ordered  to  maintain 

2  DAG  dependencies 

3  for  each  node  □  in  0 

4  do  unmark  all  nodes  of  □ 

5  fixO  as  marked  #  save  subregioniU  ) 

6  Eb  the  partition  computed  by  Subdivide(n ') 

7  COST(Cb) 

8 

9  unmark  all  nodes  of  0 

10  fix  □  as  unmarked  #  recompute  subregion{n  ) 

11  Ch  <—  the  partition  computed  by  Subdivlde(D ') 

12  Q]  COST(Ch) 

13  if  Ch  <  Cb  then  fix  □  as  marked 

We  also  replace  lines  6-8  of  the  Subdivide  procedure  with: 

14  ifD  is  a  MR  node 

15  then  ifD  is  fixed  as  marked,  then  mark  □ 

16  else  if  0  is  fixed  as  unmarked,  then  unmark  □  ; 

17  else  use  RECOMPUTE  heuristic  to  decide  if 

18  D  should  be  saved  or  recomputed; 

In  the  Search  procedure,  all  MR  nodes  are  initially  un¬ 
fixed,  Each  iteration  of  the  loop  makes  a  save  vs.  recom¬ 
pute  decision  at  just  one  MR  node.  At  each  node  □  ,  we 
use  the  Subdivide  algorithm  to  produce  two  partitions:  one 


that  results  when  subregionCU  Ois  saved  (line  6),  and  an¬ 
other  that  results  when  it  is  recomputed  (line  11).  In  other 
words,  in  each  case  we  have  already  determined  whether 
subregionEE  Dwill  be  saved  or  recomputed  before  subdivi¬ 
sion  begins.  The  code  represents  this  by  fixing  □  as  marked 
or  unmarked  (lines  5  and  9).  In  both  cases,  we  use  the  RE¬ 
COMPUTE  heuristic  described  above  to  make  save  vs.  re¬ 
compute  decisions  at  the  remaining  unfixed  MR  nodes  (lines 
17-18).  We  then  evaluate  both  partitions  using  a  given  cost 
model  (lines  7  and  12)  and  decide  to  save  or  recompute  □ 
based  on  which  partition  has  the  lower  cost  (line  13). 

The  RDS  algorithm  wraps  a  search  around  the  Subdivide 
procedure.  The  search  calls  Subdivide  Q]  times,  where  0 
is  the  number  of  MR  nodes.  By  the  analysis  in  the  previous 
section,  subdivision  has  complexity  □  DU  •  I  llJEE  Since  0  is 
typically  proportional  to  □,  the  overall  running  time  of  RDS 
is  □  ETP  *  LiXIDI] 

3.6.  Analysis 

We  have  described  two  versions  of  an  algorithm.  The  first 
one,  RDSh ,  uses  only  heuristics  to  make  merging  and  re¬ 
compute  decisions  and  has  complexity  □  CO  •  LLLlCP  where 
CEDDis  the  cost  of  the  validity  check  on  a  subregion  of  size 
Q  In  our  experiments,  we  found  that  a  simple  heuristic  is  in¬ 
adequate  for  making  save  vs.  recompute  decisions.  Thus  we 
described  a  second  version,  RDS,  that  uses  a  limited  search 
to  make  these  decisions  but  has  complexity  0  HP  •  LLLlCO 

In  our  implementation,  a  validity  check  involves  gener¬ 
ating  code  and  allocating  resources  for  the  subregion.  The 
check  has  complexity  CUD  □  □  COP  so  RDSh  and  RDS 
have  complexity  □  El?  Dand  □  Clf  Q  respectively.  The  actual 
running  time  depends  on  the  resource  consumption  of  the 
given  shader  and  the  resource  constraints  of  the  target  archi¬ 
tecture.  When  resources  are  extremely  limited,  the  validity 
check  in  line  1  of  the  Subdivide  procedure  fails  most  of 
the  time,  which  leads  to  further  traversal  of  the  PDT.  On  the 
other  hand,  when  resources  are  less  constrained,  the  validity 
check  usually  succeeds  and  terminates  the  recursion. 

4.  Implementation 

In  the  rest  of  this  paper,  we  focus  on  RDS.  We  imple¬ 
mented  RDS  and  integrated  it  with  the  Stanford  Real-Time 
Programmable  Shading  System*^.  The  integrated  system  is 
shown  in  Figure  1. 

We  evaluated  this  system  by  developing  two  fragment 
compiler  back  ends.  The  first  one  targets  the  ATI  Radeon 
8500  architecture^.  This  architecture  exposes  a  custom  set 
of  OpenGL  extensions.  We  queried  the  software  driver  us¬ 
ing  these  extensions  to  determine  the  hardware’s  resource 
constraints.  The  hardware  is  limited  to  16  operations,  6  reg¬ 
isters,  6  texture  units,  6  interpolants,  and  1  level  of  texture 
dependency.  Since  the  hardware  provides  one  output  value 
per  pass,  we  use  render-to-texture  to  spill  intermediate  val¬ 
ues  to  texture  memory.  However,  floating  point  buffers  and 
textures  are  unsupported,  so  these  values  are  not  preserved  at 
full  precision.  To  circumvent  this  issue,  we  limited  our  tests 
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on  this  architecture  to  shaders  whose  intermediate  values  arc 
already  clamped  to  the  range  im  1 1 1 
The  Radeon  8500  is  one  example  of  a  programmable  frag¬ 
ment  architecture.  To  evaluate  RDS  on  architectures  with 
different  resource  constraints,  we  wrote  a  second  back  end 
that  compiles  shaders  to  a  generic  programmable  fragment 
pipeline.  The  maximum  number  of  per-pass  operations,  reg¬ 
isters,  active  texture  units,  and  interpolants  can  be  config¬ 
ured;  we  call  each  setup  a  pipeline  configuration  (PC).  Our 
simulated  architecture  supports  4-component  floating-point 
vectors  and  one  output  value.  It  uses  the  NVIDIA  vertex  pro¬ 
gram  instruction  set^,  with  the  addition  of  texture  operations. 
The  architecture  places  no  limits  on  the  number  of  texture 
dependencies  within  a  fragment  program.  In  short,  each  PC 
provides  the  basic  data  types  and  instruction  set  necessary 
to  support  fragment  shaders,  but  imposes  four  per-pass  re¬ 
source  constraints. 

Both  compiler  back  ends  incorporate  the  following  three 
elements  to  support  RDS: 

1 .  Before  partitioning,  the  compilers  perform  instruction  se¬ 
lection  to  build  a  DAG  of  fragment  operations. 

2.  To  support  partitioning,  the  compilers  provide  a  common 
interface  to  RDS  that  exposes  a  cost  model  and  hardware- 
specific  resource  constraints. 

3.  After  partitioning,  the  compilers  order  the  passes  and  as¬ 
sign  them  to  textures. 

To  perform  instruction  selection,  both  of  our  back  ends 
use  a  modified  version  of  Iburg^.  We  extended  Iburg  to  sup¬ 
port  operators  with  arbitrary  arity  and  wrote  covering  rules 
to  map  the  shading  system’s  intermediate  representation  di¬ 
rectly  to  hardware  operations.  Note  that  unlike  Peercy  and 
Proudfoot,  we  only  used  Iburg’s  tree-matching  capabilities 
for  selecting  operations,  not  for  mapping  operations  to  ren¬ 
dering  passes. 

The  interface  between  RDS  and  the  back  ends  consists  of 
four  callback  functions: 

1.  VALID,  a  function  that  determines  if  a  given  set  of  nodes 
can  be  mapped  to  a  single  pass.  This  is  needed  to  en¬ 
sure  that  each  pass  satisfies  the  hardware’s  resource  con¬ 
straints.  Our  back  ends  implement  this  check  by  generat¬ 
ing  code  and  allocating  resources  as  needed.  The  check 
fails  if  any  part  of  the  resource  allocation  fails. 

2.  COST,  a  function  that  computes  the  cost  of  a  given  par¬ 
tition  using  a  cost  model.  It  is  called  during  the  search 
algorithm  described  in  Section  3.5.  The  cost  of  a  parti¬ 
tion  depends  on  several  factors,  including  the  number  of 
passes  Q  the  number  of  texture  accesses  P  and  the  num¬ 
ber  of  non-texture  fragment  instructions  □  We  use  a  sim¬ 
ple  linear  cost  model: 

cost  □  QjDD  CHI]  pm 

Note  that  the  costs  Qj  and  P  are  charged  on  a  per- 
fragment  basis,  so  the  total  cost  of  texture  accesses  and 
non-texture  instructions  is  proportional  to  the  number  of 
rendered  fragments.  On  the  other  hand,  Cb  is  the  overhead 
of  an  entire  pass,  so  we  can  think  of  Qj  as  the  per-pass 


cost  amortized  over  all  the  fragments.  For  our  ATI  back 
end,  the  multipass  renderer  preserves  intermediate  results 
by  saving  the  entire  framebuffer  with  render-to- texture. 
Saves  can  also  be  implemented  using  copy-to-texture,  in 
which  case  CJj  depends  on  the  size  of  the  viewport. 

3.  RECOMPUTE,  a  heuristic  function  that  decides  whether 
or  not  to  recompute  a  set  of  nodes.  This  is  called  during 
the  search  algorithm  described  in  Section  3.5.  In  our  back 
ends,  we  choose  to  recompute  a  set  of  nodes  if  and  only 
if  the  consumption  of  each  resource  is  less  than  one-half 
the  maximum  allowed. 

4.  MERGE,  a  heuristic  function  that,  given  a  set  of  passes, 
picks  the  one  that  consumes  the  fewest  resources.  This 
is  called  during  the  merging  algorithm  described  in  Sec¬ 
tion  3.2.  Sometimes  it  is  unclear  which  pass  consumes 
the  fewest  resources;  our  back  ends  break  these  ties  arbi¬ 
trarily. 

RDS  uses  these  callback  functions  to  query  a  compiler  back 
end  for  hardware-specific  information.  This  design  allows 
RDS  to  target  any  architecture  whose  compiler  back  end  im¬ 
plements  these  callbacks.  Furthermore,  it  requires  minimal 
changes  to  our  existing  compiler  infrastructure. 

After  partitioning  is  complete,  compiler  back  ends  must 
order  the  passes  and  assign  textures  to  store  intermediate  re¬ 
sults.  Ideally,  these  textures  should  be  assigned  in  a  way  that 
minimizes  the  number  of  textures  needed.  Assigning  tex¬ 
tures  is  similar  to  register  allocation,  so  we  applied  graph 
coloring  techniques  as  described  by  Chaitin^. 

5.  Results 

5.1.  Example  and  System  Demonstration 

We  now  demonstrate  how  our  ATI  Radeon  8500  compiler 
uses  RDS  to  partition  a  shader  into  multiple  passes  so  that  it 
can  be  rendered  in  real-time.  Our  shader  is  a  version  of  the 
RenderMan  bowling  pin  surface  shader  combined  with  five 
light  shaders:  one  point  light  source  and  four  animated  tex¬ 
tured  lights.  Figure  4  shows  the  source  code  for  this  shader 
written  in  the  Stanford  Real-Time  Shading  Language.  The 
compiler  front  end  maps  this  source  code  to  a  hardware- 
independent  intermediate  representation.  Next,  our  Radeon 
8500  back  end  performs  instruction  selection  and  builds  the 
DAG  in  Figure  5a.  Since  this  DAG  is  too  large  to  map  to 
a  single  pass,  the  compiler  calls  RDS  to  partition  the  DAG 
into  multiple  passes.  Figure  5b  shows  the  resulting  partition, 
which  contains  7  passes,  12  texture  fetches,  and  30  non¬ 
texture  operations.  We  ran  the  partitioned  shader  at  a  res¬ 
olution  of  LUJD  LLUon  a  1.4  GHz  Pentium  4  system  with 
an  ATI  Radeon  8500.  The  system  renders  the  shader  at  30 
frames/sec  and  produces  the  image  shown  in  Figure  7  (see 
color  plates). 

5.2.  Testing  Methodology 

In  this  section,  we  discuss  the  techniques  that  we  used  to 
evaluate  the  efficiency  of  RDS.  We  chose  three  shaders, 
seven  pipeline  configurations  of  our  simulated  architecture, 
and  five  cost  models. 
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Figure  4:  RTSL  source  code  for  the  bowling  pin  surface  shader. 

Data  for  the  shaders  are  listed  in  Table  1.  The  first  shader 
(RBP)  is  the  version  of  the  RenderMan  bowling  pin  de¬ 
scribed  above.  RBP  has  modest  computation  but  requires 
several  vertex  interpolants  and  texture  units.  The  second 
shader  (BMBP)  applies  a  bump  map  to  the  bowling  pin  sur¬ 
face  and  uses  only  one  point  light  source.  BMBP  uses  a  more 
balanced  set  of  resources  than  RBP.  The  third  shader  (Wood) 
procedurally  generates  a  wood  texture”^.  It  is  computationally 
intensive  and  requires  many  dependent  texture  lookups,  but 
uses  other  resources  modestly.  We  chose  these  three  shaders 
for  testing  because  they  stress  different  resources. 

Eight  pipeline  configurations  are  listed  in  Table  2.  PCs  1— 
4  are  limited  in  only  one  resource;  these  allow  us  to  show 
that  RDS  finds  good  partitions  on  architectures  constrained 
in  different  ways.  PCs  5-7  have  more  balanced  constraints; 
they  represent  evolving  architectures  that  provide  more  and 
more  resources  to  fragment  pipelines.  We  use  PCs  5-7  to 
show  that  RDS  performs  well  when  multiple  resources  are 
constrained.  PC  8  has  unlimited  resources  and  is  useful  for 
comparing  the  cost  of  a  single-pass  partition  to  the  cost  of 
multiple  passes. 

For  convenience,  we  will  use  the  notation  □□/□□/  H/  Cb 
to  refer  to  architectures’  resource  constraints,  where  □  is  the 
number  of  operations,  □  is  the  number  of  registers,  □  is  the 
number  of  texture  units,  and  □  is  the  number  of  interpolants. 
For  instance,  PC  5  has  constraints  Ch/  Cb/  Cb/  Ch. 

We  chose  a  simple,  linear  cost  function  QjDD  ODD  QQ 
as  discussed  in  Section  4.  For  the  ATI  Radeon  8500,  we  es¬ 
timated  the  coefficients  for  each  term  by  profiling  the  hard¬ 
ware  as  follows.  We  performed  all  our  measurements  by  ren¬ 
dering  a  screen-aligned  square  into  a  EH]  □  LUJ  window. 
First,  we  compared  the  rendering  times  of  shaders  that  dif¬ 
fered  only  in  the  number  of  non-texture  fragment  instruc¬ 
tions.  We  then  computed  P  □  A  0/  OA  Q  where  □  is  the 
number  of  fragments,  A  Ois  the  controlled  change  in  the 
number  of  non-texture  instructions,  and  A  P  is  the  measured 
change  in  rendering  time.  Our  measurements  of  P  remained 
constant  when  we  varied  □  by  resizing  the  square,  as  ex¬ 


Figure  S:  (a)  DAG  of  the  RBP  shader,  after  instruction  selection, 
(b)  A  partition  of  the  shader  with  7  passes,  computed  by  RDS  for 
the  ATI  Radeon  8500  architecture.  Texture  fetches  (T),  vertex  inter¬ 
polants  (V),  and  constants  (C)  are  shown  as  diamonds,  triangles  and 
squares,  respectively.  Circles  represent  other  fragment  operations, 
and  multiply-referenced  nodes  are  shaded.  Dotted  edges  indicate 
dependencies  between  passes. 

pected.  We  computed  H  similarly.  To  estimate  Pi,  we  re¬ 
duced  the  square  to  one  pixel  in  size  to  make  the  cost  of 
fragment  operations  negligible.  After  measuring  the  three 
coefficients,  we  normalized  them  by  P  to  obtain: 

cost  □  mmu  amT]  □ 

While  the  result  is  a  rough  estimate,  it  is  clear  that  per-pass 
overhead  is  high  relative  to  individual  fragment  operations. 
This  supports  our  assumption  that  per-pass  overhead  domi¬ 
nates  the  overall  cost. 

For  our  simulated  architecture,  we  also  tested  RDS  with 
the  following  cost  models: 


.2,  .2.1 
1.  ,1.  .1 
-  .2,  .2,  1 
.-1,1 
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Shader 

RBP 

BMBP 

Wood 

#  total  nodes  (□) 

68 

72 

475 

#  MR  nodes  (C^ 

n 

14 

72 

Non-texture  operations 

27 

33 

175 

Texture  operations 

9 

3 

33 

Total  operations 

36 

36 

208 

Registers 

7 

5 

16 

Texture  units 

9 

3 

6 

Interpolants 

24 

6 

5 

Table  1:  Shaders.  Resource  consumption  on  our  simulated  archi- 
lecture  is  listed  for  each  shader.  The  shader  with  the  largest  con- 
sumption  in  each  category  is  shown  in  boldface.  The  last  four  rows 
contain  the  primary  resource  constraints.  For  example,  in  order  for 
RBP  to  compile  to  a  single  pass,  our  pipeline  configuration  must 
support  at  least  36  operations,  7  registers,  9  texture  units,  and  24 
interpolants. 


Architecture 

□ 

□ 

□ 

□ 

PC  1  (operation-limited) 

6 

oo 

00 

00 

PC  2  (register-limited) 

00 

4 

oo 

oo 

PC  3  (texture-limited) 

oo 

oo 

4 

oo 

PC  4  (interpolant-limited) 

oo 

oo 

oo 

4 

PCS 

6 

4 

4 

4 

PC6 

24 

8 

8 

8 

PC7 

128 

12 

16 

12 

PC  8  (unlimited) 

oo 

oo 

oo 

oo 

Table  2:  Architectures.  Four  resource  constraints  are  given  for  each 
architecture:  operations  (Q,  registers  (P,  texture  units  (P  ,  and  ver¬ 
tex  interpolants  (CJ. 

□XiD  nJD  □ 

□□□  HE  □ 

EEDCIE  □ 

□□□□□ 

□□□ 

We  picked  different  coefficients  to  show  that  RDS  performs 
well  for  architectures  with  different  performance  character¬ 
istics.  Note  that  the  last  cost  model  CD  [Completely  ignores 
per-pass  overhead. 

5.3.  Efficiency 

For  comparison,  we  implemented  the  algorithm  in  Section 
3,1  that  uses  exhaustive  search  to  find  the  optimal  partition. 
It  is  interesting  to  note  that  in  all  of  our  tests,  RDS  always 
found  a  partition  with  the  same  number  of  passes  as  the  opti¬ 
mal  partition.  However,  there  is  a  difference  in  cost  because 
RDS  picked  different  pass  boundaries,  which  can  affect  the 
total  number  of  restore  operations  needed. 

Table  3  compares  the  partitions  computed  by  RDS  using 
the  cost  model  1 1 1 1  i  LUJ  Qo  the  optimal  partitions.  In  some 
cases,  the  search  space  was  too  large  for  the  exhaustive  algo¬ 
rithm  to  finish.  We  measure  the  efficiency  of  RDS  by  com¬ 
puting  the  percentage  increase  in  cost  of  the  RDS-generated 
partition  over  the  optimal  partition.  There  are  17  complete 


test  cases  not  including  PC  8.  RDS  found  a  minimum-cost 
partition  in  14  of  these  cases,  and  was  within  5%  of  optimal 
cost  in  each  of  the  remaining  cases. 

Results  for  the  other  cost  models  are  similar.  RDS  found 
an  optimal  partition  in  two-thirds  of  all  the  test  cases.  For  the 
remaining  cases,  RDS  was  within  5%  of  optimal  on  average 
and  within  15%  in  the  worst  case.  The  worst  cases  occurred 
when  using  the  cost  models  CD  HJ  Chnd  CE  DThis  is  not  sur¬ 
prising,  since  RDS  was  designed  under  the  assumption  that 
per-pass  overhead  dominates  the  overall  cost.  Nonetheless, 
these  results  indicate  that  RDS  performs  consistently  well 
across  different  cost  models. 

5.4.  Speed  vs.  Quality  TVadeoff 

In  practice,  one  can  imagine  a  knob  that  allows  users  to 
trade  off  speed  for  partition  quality.  As  an  example,  we  com¬ 
pare  RDS  to  RDSh  in  Table  4.  We  measure  the  efficiency 
of  RDSh  as  a  percentage  increase  in  cost  over  RDS.  Over 
the  21  cases,  the  average  cost  of  RDSh  is  10.5%  higher. 
Since  RDSh  uses  only  heuristics  and  RDS  performs  a  lim¬ 
ited  search,  it  is  not  surprising  that  RDSh  runs  faster  than 
RDS  at  the  expense  of  quality. 

Both  versions  of  the  algorithm  may  be  useful  in  practice. 
For  example,  a  fast  partitioning  scheme  such  as  RDSh  can  be 
used  during  iterative  development  of  new  shaders.  Once  the 
shader  is  complete,  a  slower  but  more  thorough  algorithm 
such  as  RDS  is  used  to  produce  efficient  partitions,  which  in 
turn  will  improve  rendering  performance. 

5.5.  Cost  Analysis 

In  this  section,  we  study  the  cost  of  saves  and  restores  rel¬ 
ative  to  the  total  cost  of  a  partition.  Table  5  compares  two 
partitions  of  the  RBP  shader.  The  1 1-pass  partition  contains 
53  operations,  of  which  1 1  are  restores.  In  contrast,  the  3- 
pass  partition  contains  44  operations,  of  which  only  2  are 
restores.  Thus  all  9  of  the  extra  operations  in  the  former  par¬ 
tition  are  due  to  restores.  This  overhead  occurs  because  the 
partition  simply  requires  more  intermediate  values. 

As  the  number  of  passes  increases,  the  additional  costs 
come  primarily  from  save  and  restore  overhead.  Figure  6 
shows  the  cost  breakdown  for  seven  partitions  of  the  RBP 
shader,  each  generated  by  RDS.  These  partitions  are  the 
same  as  the  ones  listed  in  Table  3,  but  they  are  now  ar¬ 
ranged  in  order  of  increasing  number  of  passes.  The  cost 
of  non-texture  operations  and  non-restore  texture  fetches  re¬ 
main  nearly  constant  across  the  graph;  the  slight  variations 
arise  from  recomputation.  In  contrast,  the  cost  of  saves  and 
restores  rises  steadily.  This  suggests  that  rendering  perfor¬ 
mance  could  be  improved  by  changes  to  hardware  that  use 
more  efficient  methods  for  saving  and  restoring  intermediate 
results. 

6.  Discussion 

In  this  section  we  discuss  limitations  of  RDS  and  future 
work. 
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Table  4:  RDS  vs.  RDS^  -  Running  times  for  both  versions  of  the  algorithm  were  measured  on  a  1.4  GHz  Pentium  4  system  running  Windows 
2000  SP2.  All  times  are  reported  in  seconds.  The  efficiency  of  RDS^  is  computed  as  a  percentage  increase  in  cost  over  RDS. 


Table  5:  Per-pass  resource  consumption  for  two  partitions  of  the 
RBP  shader.  Column  □  gives  the  number  of  restores.  The  11 -pass 
and  3-pass  partitions  target  PCs  5  and  7,  respectively.  Both  parti¬ 
tions  were  computed  by  RDS. 

In  designing  RDS,  we  have  assumed  that  a  shader  can 
be  represented  as  a  single  DAG,  which  is  equivalent  to 
one  basic  block  of  hardware  assembly  code.  However,  pro¬ 
grammable  graphics  hardware  will  likely  support  loops  and 
conditionals  in  the  future.  Since  this  requires  flowgraphs 
with  multiple  basic  blocks,  we  need  additional  techniques 
to  handle  branching  correctly  and  efficiently. 

We  have  assumed  that  programmable  graphics  hardware 
allows  only  one  output  value  per  pass.  However,  future  hard¬ 
ware  may  support  multiple  outputs,  so  a  single  pass  can  com¬ 


Number  of  passes 

Figure  6:  Cost  of  partitions,  broken  down  by  type.  These  partitions 
were  computed  by  RDS  for  the  RBP  shader  with  the  cost  model 
1113+  EDf  □ 

pute  several  results.  These  computations  may  be  unrelated 
and  correspond  to  disjoint  regions  of  the  DAG.  Efficient  par¬ 
titioning  becomes  more  difficult  because  it  requires  exam¬ 
ining  many  disjoint  regions  simultaneously.  Our  current  al¬ 
gorithm  cannot  handle  multiple  outputs  because  it  considers 
only  one  connected  region  at  a  time. 

Accurate  cost  models  are  needed  to  enable  RDS  to  find 
partitions  that  render  efficiently.  For  that  reason,  RDS  al¬ 
lows  any  cost  model  to  be  plugged  in.  However,  we  make 
some  simplifying  assumptions  in  our  cost  model.  Some  of 
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these  assumptions  are  necessary  because  we  don’t  have  all 
the  relevant  cost  information  at  compile  time.  For  example, 
the  per-pass  cost  may  depend  on  the  viewport  size,  and  the 
total  cost  of  instructions  depends  on  the  number  of  rendered 
fragments.  In  our  shading  system,  however,  neither  the  view¬ 
port  size  nor  the  number  of  fragments  are  known  at  compile 
time,  so  we  simply  treat  the  coefficients  H,  Cfc,  and  P  as  con¬ 
stants.  On  the  other  hand,  some  limitations  can  be  addressed 
with  more  sophisticated  cost  models.  For  example,  our  cur¬ 
rent  model  ignores  the  cost  of  vertex  shaders.  If  rendering 
performance  is  limited  by  vertex  computations,  then  it  is  de¬ 
sirable  to  find  a  partition  that  minimizes  recomputation  of 
regions  containing  vertex  inputs.  This  could  be  done  by  us¬ 
ing  a  cost  model  in  which  the  per-pass  cost  is  proportional 
to  the  number  of  vertex  operations  needed  by  that  pass. 

We  showed  that  RDS  and  RDSh  are  DUf  ♦  EIDCDand 
□  m  •  LlULUalgorithms,  but  it  may  be  possible  to  eliminate 
the  UJJUterm.  For  simplicity  and  ease  of  implementation, 
we  check  validity  by  calling  existing  compiler  subroutines. 
However,  this  requires  that  subregions  be  compiled  from 
scratch  every  time,  so  HED  □  □  WD  The  check  could  be 
made  less  expensive  by  using  incremental  techniques,  since 
a  subregion  depends  only  on  previously  checked  subregions. 
One  approach  would  be  to  use  vectors  to  keep  track  of  a  sub¬ 
region’s  resource  consumption.  Resource  vectors  from  dif¬ 
ferent  subregions  could  be  added  to  determine  quickly  if  the 
subregions  can  be  merged.  However,  overlapping  subregions 
and  peephole  optimizations  are  tricky  and  must  be  treated 
carefully.  Using  incremental  approaches,  it  may  be  possible 
to  reduce  the  cost  of  the  validity  check  to  constant  time  when 
amortized  over  the  traversal  of  the  entire  DAG.  This  would 
reduce  the  complexity  of  RDS  and  RDSh  by  a  factor  of  0. 

7.  Conclusion 

We  have  described  RDS,  an  algorithm  for  partitioning  frag¬ 
ment  shaders  into  multiple  passes.  Our  algorithm  finds  near- 
optimal  partitions  for  a  number  of  shaders  on  architectures 
with  different  limitations.  Furthermore,  RDS  performs  con¬ 
sistently  well  across  a  range  of  different  cost  models.  We 
have  integrated  RDS  with  an  existing  programmable  shad¬ 
ing  system  and  demonstrated  how  this  system  can  be  used  to 
partition  and  render  a  large  shader  in  real-time.  Since  RDS 
depends  only  on  a  flexible  cost  model  and  a  set  of  resource 
constraints,  it  can  be  readily  applied  to  future  programmable 
graphics  architectures. 
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Figure  7:  Render  Man  bowling  pin  with  1  point  light  source  and  4  animated  textured  lights.  Since  the  shader  is  too  large  to  map  to  a  single 
pass,  our  RDS  algorithm  splits  it  into  multiple  passes.  The  pin  renders  at  30  frames! sec  on  a  1.4  GHz  Pentium  4  system  with  an  ATI  Radeon 
8500  graphics  card. 
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